Customer Churn Project

Project Summary

This project analyzes customer churn data to identify the key factors influencing whether a customer leaves a telecom service. Using skills in data wrangling, exploratory data analysis (EDA), and statistical correlation, the project uncovers that features like contract type, tenure, and payment method are highly associated with customer retention. Visualization, feature interpretation, and data storytelling are applied to translate patterns into actionable insights for improving customer retention strategies.

The interactive dashboard below allows you to explore the data, visualize patterns, and even simulate customer scenarios to predict churn probability based on various attributes.

Overview & Background

A business will measure customer churn as the loss of existing customers continuing doing business or using their service with the company, compared to the total number of customers in a given period of time. Analyzing customer churn is important for a business to understand why a customer will stop using their service or want to stop doing business with them. Improving their customer retention is good for building brand loyalty and increasing overall customer satisfaction and profitability. While there are formulas that are easy to calculate what the customer churn is, it is difficult to accurately predict.

This dataset that I will be using comes from a telecommunication company and it provides the home phone and internet services to 7043 customers in California.

The data set includes information about:

Customers who left within the last month – the column is called Churn
Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
Demographic info about customers – gender, age range, and if they have partners and dependents

In this project I will analyze the different factors that affect customer churn by creating regression models to identify correlation as well as creating a survival analysis model. I also create a prediction model using classification machine learning to accuractely predict the likeliness of a customer to churn.

Objectives

What is the current churn percentage for this company?
What factors directly affect customer churn, and how does it differ?
Does demographics or type of telecommunication service affect whether or not a customer will churn?
Which services are the most profitable?
How long before a customer will change companies or churn?

The data comes from Kaggle and can be accessed here

Python Code for Libraries Imported

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from lifelines.plotting import plot_lifetimes
from lifelines import KaplanMeierFitter
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.ensemble import RandomForestClassifier
import statsmodels.api as sm
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.metrics import accuracy_score, classification_report, log_loss
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb
from xgboost import XGBClassifier

81.3%

Accuracy

The Random Forest model achieved an overall accuracy of approximately 78.5%, as indicated by the confusion matrix. While it demonstrated strong performance in correctly identifying non-churners, it struggled with correctly classifying churners, reflected in a relatively low recall for the positive class (around 47.2%). In contrast, the improved Logistic Regression model, though slightly lower in raw predictive performance (as Logistic Regression models often are), offered better interpretability and highlighted statistically significant factors driving churn, such as contract type, tech support, and online security services. After refining the feature set and tuning hyperparameters, the Logistic Regression model's accuracy improved and approached that of the Random Forest, with additional benefits of clarity and actionable business insights. Ultimately, while the Random Forest may slightly outperform in predictive power, the improved Logistic Regression model bridges much of that gap and excels in explainability — making it a strong choice for stakeholder-facing use or policy-making decisions.

Customer Churn Prediction

Project Summary

Overview & Background

Objectives

Summary Statistics

Churn Analysis Visualization

Visualizations for Charges & Tenure

Monthly Charges by Churn (KDE Plot)

Total Charges by Churn (KDE Plot)

Tenure by Churn (Violin Plot)

Visualizations for other Variables

Key Metrics

Churn Survival Analysis

Churn Survival by Contract Type

Churn Survival by Payment Method

Key Insights

High Risk Customer Segments

Recommended Retention Strategies

Key Churn Factors

Correlation between Churn & Others

Correlation

Correlation Heatmap

Machine Learning Models

Random Forest Model

Confusion Matrix

ROC Curve

Logistical Regression Model

Confusion Matrix

Regression Table

Accuracy